124
Applications in Natural Language Processing
TABLE 5.1
Performance of our quantization method on the WMT14 EN-DE and WMT14 EN-FR
test set.
Model Method
Precision
EN-DE
EN-FR
PPL BLEU Size (Gb) Compr. PPL BLEU Size (Gb) Compr.
Base
Baseline
32-bit
4.95 26.46
2.02
1x
3.21 38.34
1.94
1x
Default Approach
8-bit
74.04 0.21
0.52
3.91x
nan
0
0.50
3.91x
Post-Quantization
8-bit
4.97 26.44
0.52
3.91x
3.26 38.30
0.50
3.91x
FullyQT
8-bit
4.94 26.38
0.52
3.91x
3.23 38.41
0.50
3.91x
Post-Quantization
6-bit
6.00 24.84
0.39
5.18x
3.98 35.02
0.37
5.17x
FullyQT
6-bit
5.09 26.98
0.39
5.18x
3.38 37.07
0.37
5.17x
FullyQT
4-bit
11.96 18.32
0.26
7.66x 48.21 1.59
0.25
7.64x
Big
Baseline
32-bit
4.38 27.13
6.85
1x
2.77 40.54
6.69
1x
Post-Quantization
8-bit
4.27 26.55
1.74
3.95x
2.78 39.78
1.69
3.95x
FullyQT
8-bit
4.57 26.96
1.74
3.95x
2.80 40.25
1.69
3.95x
Post-Quantization
6-bit
5.12 24.86
1.31
5.24x
3.08 37.92
1.28
5.24x
FullyQT
6-bit
4.78 26.76
1.31
5.24x
2.87 39.59
1.28
5.24x
FullyQT
4-bit
33.11 10.22
0.88
7.79x 42.42 2.81
0.86
7.79x
for all weight matrices. For activations, they use tensor bucketing for the following ten-
sors: the sum of input embeddings with the positional encoding, the Q, K, V inputs, the
scaled dot-product attention’s output, the feed-forward’s output, the LayerNorm’s numer-
ator, quotient, and output.
5.2.4
Dealing with Zeros
Unlike the classic quantization method proposed in [104], they do not nudge the domain
so that the zero value gets perfectly mapped. Specifically, the only zero values are the
padding, the Softmax numerator, and output, the output of ReLU layers, and dropouts.
Since padding does not affect the final output, they ignore these values when quantizing. For
the quantization parameter, xmin for ReLUs and the Softmax’s numerator and output are
fixed to 0, guaranteeing the perfect value mapping. Finally, quantization is applied before
any dropout operation.
In Table 5.1 shows the performance of the proposed method on the WMT14 EN-DE
and WMT14 EN-FR. They compare results with two full-precision Transformers: base and
big variants. Two other quantization approaches are evaluated. The first is the “default”
approach, which naively quantizes every possible operation. The second approach applies
the proposed quantization strategy post-training. In all cases except for post-quantization,
BLEU was computed on the test set using the checkpoint which scored the highest accuracy
on the validation set. Towards the end of training, they ran one validation epoch for every
100 training steps. Baselines and FullyQT 8-bit results were averaged over 5 trials. Standard
deviation of the BLEU scores did not seem higher for any method and ranged between 0.09
and 0.51. Training with quantization was about twice as slow as with the baselines. As for
post-training quantization, the BLEU score was computed on the test set using the best
validation performance out of 20 trials. The default approach’s nan in the EN-FR task
is due to numerical instability. By quantizing every operation, zeros in the LayerNorm’s
denominator are more frequent.
In summary, this paper’s contributions are as follows: (1) a uniform quantization scheme;
(2) a detailed demonstration of the choice of quantized layer; (3) a tensor bucketing method
for achieving higher precision; and (4) a special design for zeros.